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CN ■ Abstract 

I Privacy-preserving machine learning algorithms are crucial for the increasingly common set- 



I ' ting in which personal data, such as medical or financial records, are analyzed. We provide 

general techniques to produce privacy-preserving approximations of classifiers learned via (regu- 
larized) empirical risk minimization (ERM). These algorithms are private under the e- differential 
privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of 
Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturha- 
\^ \ tion, for privacy-preserving machine learning algorithm design. This method entails perturbing 

' the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain 

C/3 ' convexity and differentiability criteria, we prove theoretical results showing that our algorithms 

preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further 
present a privacy-preserving technique for tuning the parameters in general machine learning 
lO ' algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply 

[ these results to produce privacy-preserving analogues of regularized logistic regression and sup- 

port vector machines. We obtain encouraging results from evaluating their performance on real 
demographic and benchmark data sets. Our results show that both theoretically and empiri- 



o 



■ cally, objective perturbation is superior to the previous state-of-the-art, output perturbation, in 



(N 



managing the inherent tradeoff between privacy and learning performance. 



0\ . 

O ■ 1 Introduction 



, . ■ Privacy has become a growing concern, due to the massive increase in personal information stored 

, in electronic databases, such as medical records, financial records, web search histories, and social 

I network data. Machine learning can be employed to discover novel population-wide patterns, 

however the results of such algorithms may reveal certain individuals' sensitive information, thereby 
violating their privacy. Thus, an emerging challenge for machine learning is how to learn from 
datasets that contain sensitive personal information. 

At the first glance, it may appear that simple anonymization of private information is enough to 
preserve privacy. However, this is often not the case; even if obvious identifiers, such as names and 
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addresses, are removed from the data, the remaining fields can still form unique "signatures" that 
can help re-identify individuals. Such attacks have been demonstrated by various works, and are 
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dataset may not b e sufficient to preserve privacy, as illustrated on genetic data ( Homer et al. . 20081 : 



Wang et al.l . I2OO9I I. Thus, there is a great need for designing machine learning algorithms that also 



preserve the privacy of individuals in the datasets on which they train and operate. 

In this paper we focus on the problem of classification, one of the fundamental problems of 
machine learning, when the training data consists of sensitive information of individuals. Our work 
addresses the Empirical risk minimization (ERM) framework for classification, in which a classifier 
is chosen by minimizing the average over the training data of the prediction loss (with respect 
to the label) of the classifier in predicting each training data point. In this work, we focus on 
regularized ERM in which there is an additional term in the optimization, called the regularizer, 
which penalizes the complexity of the classifier with respect to some metric. Regularized ERM 
methods are widely used in practice, for example in logistic regression and support vector machines 
(SVMs), and many also have theoretical justification in the form of g eneralization e rror bounds 
with respect to independently, identically distributed (i.i.d.) data (see Vapnik ( 19981 ) for further 
details). 

For our privacy measure, we use a definition due to lDwork et al.l (|2006bl ). who have proposed a 
measure of quantifying the privacy-risk associated with computing functions of sensitive data. Their 
e- differential privacy model is a strong, cryptographically-motivated definition of privacy that has 
recently received a significant amou nt of research attention for its robustness to known attacks, such 
as those involving side information ( Ganta et al. . 20081 ). Algorithms satisfying e-differential privacy 
are randomized; the output is a random variable whose distribution is conditioned on the data set. 
A statistical procedure satisfies e-differential privacy if changing a single data point does not shift 
the output distribution by too much. Therefore, from looking at the output of the algorithm, it is 
difficult to infer the value of any particular data point. 

In this paper, we develop methods for approximating ERM while guaranteeing e-differential 
privacy. Our results hold for loss functions and regularizers satisfying certain differentiability 
and convexity conditions. An important aspect of our work is that we develop methods for end- 
to-end privacy; each step in the learning process can cause additional risk of privacy violation, 
and we provide algorithms with quantifiable privacy guarantees for training as well as parameter 
tuning. For training, we provide two privacy-preserving app roximations to ERM. The first is output 
perturbation, based on the sensitivity method proposed by Dwork et al. ( 2006bl ) . In this method 
noise is added to the output of the standard ERM algorithm. The second method is novel, and 
involves adding noise to the regularized ERM objective function prior to minimizing. We call 
this second method objective perturbation. We show theoretical bounds for both procedures; the 
theoretical performance of objective perturbation is superior to that of output perturbation for 
most problems. However, for our results to hold we require that the regularizer be strongly convex 
(ruling Li regularizers) and additional constraints on the loss function and its derivatives. In 
practice, these additional constraints do not affect the performance of the resulting classifier; we 
validate our theoretical results on data sets from the UCI repository. 

In practice, parameters in learning algorithms are chosen via a holdout data set. In the context 
of privacy, we must guarantee the privacy of the holdout data as well. We exploit results from 
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the theory of differential privacy to develop a privacy-preserving parameter tuning algorithm, and 
demonstrate its use in practice. Together with our training algorithms, this parameter tuning 
algorithm guarantees privacy to all data used in the learning process. 

Guaranteeing privacy incurs a cost in performance; because the algorithms must cause some 
uncertainty in the output, they increase the loss of the output predictor. Because the e-differential 
privacy model requires robustness against all data sets, we make no assumptions on the underlying 
data for the purposes of making privacy guarantees. However, to prove the impact of privacy 
constraints on the generalization error, we assume the data is i.i.d. according to a fixed but unknown 
distribution, as is standard in the machine learning literature. Although many of our results hold 
for ERM in general, we provide specific results for classification using logistic regression and s uppor t 
vector machines. Some of the former results were reported in Chaudhuri and Monteleoni ( 20081 ) : 
here we generalize them to ERM and extend the results to kernel methods, and provide experiments 
on real datasets. 

More specifically, the contributions of this paper are as follows: 

• We derive a computa tionally effici e nt algo rithm for ERM classification, based on the sensi- 



tivity method due to iDwork et al 



(|2006bl l. We analyze the accuracy of this algorithm, and 
provide an upper bound on the number of training samples required by this algorithm to 
achieve a fixed generalization error. 

We provide a general technique, objective perturbation, for providing computationally efficient, 
differentially private approx i matio ns to regularized ERM algorithms. This extends the work of 
Chaudhuri and Monteleoni ( 20081 ). which follows as a special case, and corrects an error in the 
arguments made there. We apply the general results on the sensitivity method and objective 
perturbation to logistic regression and support vector machine classifiers. In addition to 
privacy guarantees, we also provide generalization bounds for this algorithm. 

For kernel methods with nonlinear kernel functions, the optimal classifier is a linear combina- 
tion of kernel functions centered at the training points. This form is inherently non-private 
because it r eveals the training data. We ad apt a random projection method due to Rahimi 
and Recht (jRahimi and Rechtl . l2007l . l2008bl l. to develop privacy-preserving kernel-ERM al- 
gorithms. We provide theoretical results on generalization performance. 

Because the holdout data is used in the process of training and releasing a classifier, we 
provide a privacy-preserving parameter tuning algorithm based on a randomized selection 
procedure (jMcSherrv and Taiwan . 120071 ) applicable to general machine learning algorithms. 
This guarantees end-to-end privacy during the learning procedure. 

We validate our results using experiments on two datasets fr om the UCI Machine L earning 
repositories ( Asuncion and Newman . 2007 )) and KDDCup ( Hettich and Bay . 19991 ). Our 
results show that objective perturbation is generally superior to output perturbation. We 
also demonstrate the impact of end-to-end privacy on generalization error. 



1.1 Related Work 

There has been a signific ant amount of literature on the i neffectiveness of simple anonymization 
procedures. For example, Narayanan and Shmatikov ( 20081 ) show that a small amount of auxiliary 
information (knowledge of a few movie-ratings, and approximate dates) is sufficient for an adversary 
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to re-identify an individual in the Netflix dataset, which consists of anonymized data about Netflix 
users and their movie rati ngs. The same phenoni enon has been obser ved in other kinds of data, such 
as social network graphs ( Backstrom et al. . 2007 ). search query logs ( Jones et al. . 200?! ) and others 



Relea sing statistics computed on sensitive data can also be problematic; for example, IWang et al 



(I2OO9I ) show that releasing ii^-values computed on high-dimensional genetic data can lead to privacy 
breaches by an adversary who is armed with a small amount of auxiliary information. 



T here has also been a , significant amount of work on privacy-preserving d ata mining (jAgrawal and Srikant , 

2000; lEvfimievski et al.l . l2003l : ISweeneyl . I2OO2I : iMachanavaiihala et al.l . l200fil l , spanning several com- 



munities, that uses privacy models other than differential privacy. Many of the models used have 
been shown to be susceptible to com position attacks, at tacks in whic h the adversary has some 
reasonable amount of prior knowledge ( Ganta et al. . 20081 ). Other work ( Mangasarian et al. . 20081 ) 
considers the problem of privacy-preserving SVM classification when separate agents have to share 
private data, and provides a solution that uses random kernels, but does provide any formal privacy 
guarantee. 

A n alt ernative line of privacy work is in the Secure Multiparty Computation setting due to 

Ya3 (|l982l V where the sensitive data is split across mul tiple hostile datab a ses, a nd th e goal is to 



compu te a function on the union of these databases. Zhan and Matwin ( 200?! ) and Laur et al 



(I2OO6I ) consider computing privacy-preserving SVMs in this setting, and their goal is to design a 
distributed protocol to learn a classifier. This is in contrast with our work, which deals with a 
setting where the algorithm has access to the entire dataset. 
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2008). Unlike manv other privacv def- 



initions, such as those mentioned above, differential priva cy has been s hown to be resistant to 



composition attacks (attacks involving side- information) ( Ganta et al. . 20081 ). Some follow-up 



wo rk on differential pr ivacy includes work on differentially-private combinatori al optimization, due 



Gupta et al. ( 201ol'). and differe nt ially- private contingency tab les, due to Barak et al. (2007) 



to 

and iKasivishwanathan et al.l (120101). 



of differential privacy, and IZhou et al 



Wasserman and Zhou ( 2010l ) provide a more statistical view 
(I2OO9I ) provide a technique of generating synthetic data using 
compression via random linear or affine transformations. 

Prev ious literature has also c onside red learning with differential privacy. One of the first such 
works is Kasiviswanathan et al. ( 20081 ) . which presents a general, a- l thoug h computationally inef- 
ficient, method for PAC-learning finite concept classes. iBlum et al.l (|2008l ) presents a method for 
releasing a database in a differentially-private manner, so that certain fixed classes of queries can 
be answered accurately, provided the class of queries has a bounded VC-dimension. Their method s 



can also be used to learn classifiers with a fixed VC-dimension - see Kasiviswanathan et al. (|2008l ) 



however the resulting algorithm is also comput ationally ineff i cient. Some sample c omplexity lower 



bounds in this setting have been provided by Beimel et al. ( 2O10l ). In addition, Dwork and Lei 



(|2009l ) explore a connection between differential privacy and robust statistics, and provide an algo- 
rithm for privacy-preserving regression using ideas from robust statistics. However, their algorithm 
also requires a running time which is exponential in the data dimension, and is hence computation- 
ally inefficient. 

This work builds on our preliminary work in IChaudhuri and Monteleonil (120081). We f i rst show 
how to extend the sensitivity method, a form of output perturbation, due to iDwork et al.l (j2006bl ) . 
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to classification algorithms. In general, output perturbation methods alter the output of the func- 
tion computed on the database, before releasing it; in particular the sensitivity method makes 
an algorithm differentially private by adding noise to its output. In the classification setting, the 
noise protects the privacy of t he training data, but in creases the prediction error of the classifier. 
Recently, independent work bv lRubinstein et al.l toO^ ) has reported an extension of the sensitivity 
method to linear and kernel SVMs. Their utility analysis differs from ours, and thus the analo- 
gous generalization bounds are not comparable. Because Rubinstein et al. use techniques from 
algorithmic stability, their utility bounds compare the private and non-private classifiers using the 
same value for the regularization parameter. In contrast, our approach takes into account how 
the value of the regularization parameter might change due to privacy constraints. In contrast, we 
propose the objective perturbation method, in which noise is added to the objective function before 
optimizing over the space classifiers. Both the sensitivity method and objective perturbation result 
in computationally efficient algorithms for our specific c ase. In general, our theoretical bounds on 
sample requirement are incomparable with the bounds of Kasiviswanathan et al. ( 20081 ) because of 
the difference between their setting and ours. 



Ou r approach to privacy-preserving tuning uses the exponential mechanism of lMcSherrv and Talwar 
(|2003) by training classifiers with different parameters on disjoint subsets of the data and then ran- 
domizing the selection of which classifier to release. This bears a superficial resemblance to the 
sample-and-aggregate ( Nissim et al. . 200?! ) and V-fold cross-validation, but only in the sense that 
only a part of the data is used to train the classifier. One drawback is that our approach requires sig- 
nificantly more data in practice. Other approaches to selecting the regular ization parameter c ould 
benefit from a more careful analysis of the regularization parameter, as in iHastie et al.1 (|2004l ). 



2 Model 

We will use ||x||, ||x||^, and ||x||^ to denote the £2 

-norm, ^Q^-norm, and norm in a Hilbert space 7^, 
respectively. For an integer n we will use [n] to denote the set {1, 2, . . . , n}. Vectors will typically 
be written in boldface and sets in calligraphic type. For a matrix A, we will use the notation ||^||2 
to denote the L2 norm of A. 



2.1 Empirical Risk Minimization 

In this paper we develop privacy-preserving algorithms for regularized empirical risk minimization, 
a special case of which is learning a classifier from labeled examples. We will phrase our problem in 
terms of classification and indicate when more general results hold. Our algorithms take as input 
training data T> = {(xj,yj) € X X y : i = 1,2, . . . ,n} of n data-label pairs. In the case of binary 
classification the data space X = R'^ and the label set y = {-1, +1}. We will assume throughout 
that X is the unit ball so that ||xj||2 < 1. 

We would like to produce a predictor i : X ^ y. We measure the quality of our predictor on 
the training data via a nonnegative loss function £ : y x y ^ M. 

In regularized empirical risk minimization (ERM), we choose a predictor f that minimizes the 
regularized empirical loss: 

1 " 

j{i,v) = -Y,m^^),y^) + ^N{^). (1) 
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This minimization is performed over f in an hypothesis class %. The regularizer A^(-) prevents 
over-fitting. For the first part of this paper we will restrict our attention to linear predictors and 
with some abuse of notation we will write f(x) 



2.2 Assumptions on loss and regularizer 

The conditions under which we can prove results on privacy and generalization error depend on 
analytic prop erties of the loss and regula rizer. In particular, we will require certain forms of 
convexity (see Rockafellar and Wets ( 19981 )). 



Definition 1. A function H{i) over f S M'^ is said to he strictly convex if for all a G (0, 1), f, and 



H (af + (1 - a)g) < aH{i) + (1 - a)H{g) 



It is said to be A-strongly convex if for all a G (0, 1), f, and g, 



H (af + (1 - a)g) < aH{{) + (1 - a)H{g) 



^Aa(l 



a) ||f ■ 



l|2 
g|l2 



(2) 



(3) 



Bovd and Vandenberghe Strong 



A strictly convex function has a unique minimum - see 
convexity plays a role in guaranteeing our privacy and generalization requirements. For our privacy 
results to hold we will also require that the regularizer N{-) and loss function •) be differentiable 
functions of f . This excludes certain classes of regularizers, such as the ^i-norm regularizer A^(f) = 
||f ||-^, and classes of loss functions such as the hinge loss £svM(f"^x, y) = {1 — yf^x)+. In some cases 
we can prove privacy guarantees for approximations to these non-differentiable functions. 



2.3 Privacy model 

We are interested in producing a classifier in a manner that preserves the privacy of individual 
entries of the dataset P that is used in trai ning the clas s ifier. The notion of 



entries or tne aataset JJ tnat is usea m trai ning tne cias s mer. ine notion or p rivacy we use is 
the e- differential privacy model, developed by Dwork et al. ( 2006bl ): Dwork ( 20061 ) . which defines a 
notion of privacy for a randomized algorithm A{T>). Suppose A{T>) produces a classifier, and let 
V be another dataset that differs from D in one entry (which we assume is the private value of 
one person). That is, D' and P have n — 1 points {xi,yi) in common. The algorithm A provides 
differential privacy if for any set 5, the likelihood that A{'D) G 5 is close to the likelihood A{T>') G 5, 
(where the likelihood is over the randomness in the algorithm). That is, any single entry of the 
dataset does not affect the output distribution of the algorithm by much; dually, this means that 
an adversary, who knows all but one entry of the dataset, cannot gain much additional information 

about the last entry by observing the output of the algorithm.^ 

T he following definition of d ifferential privacy is due to bwork et all (j2006bl V paraphrased 
from Wasserman and Zhou ( 2010l ). 



Definition 2. An algorithm A{B) taking values in a set T provides ep- differential privacy if 

fi{S \ B = V) 



sup sup 
5 v,V' fJ- {<-> 



B = V' 



(4) 



where the first supremum is over all measurable S C T, the second is over all datasets T> and T>' 
differing in a single entry, and l-t{-\B) is the conditional distribution (measure) on T induced by 



6 



V 



v 




s 



Figure 1: An algorithm which is differentiahy private. When datasets which are identical except 
for a single entry are input to the algorithm the two distributions on the algorithm's output are 
close. For a fixed measurable S the ratio of the measures (or densities) should be bounded. 

the output A{B) given a dataset B. The ratio is interpreted to be 1 whenever the numerator and 
denominator are both 0. 

Note that if 5 is a set of measure under the conditional measures in duced by T> and T) ' , the 
ratio is automatically 1. A more measure-theoretic definition is given in IZhou et al.l (|2009l ). An 
illustration of the definition is given in Figure [TJ 

The following form of the definition is due to bwork et~all (j2nn6al l. 



Definition 3. An algorithm A provides Ep- differential privacy if for any two datasets T) and V 
that differ in a single entry and for any set S, 

exp(-ep)P(^(P') G cS) < ^{A{V) G 5) < exp(ep)P(^(P') G S), (5) 

where A{T>) (resp. A{T)')) is the output of A on input T) (resp. D'). 

We observe that an algorithm A that satisfies Equation U also satisfies Equation [5l and as a 
result, Definition [2] is stronger than Definition [3l 

From this definition, it is clear that the A{'D) that outputs the minimizer of the ERM objective 
([TJ does not provide ep-differential privacy for any Cp. This is because an ERM solution is a linear 
combination of some selected training samples "near" the decision boundary. If T> and P' differ in 
one of these samples, then the classifier will change completely, making the likelihood ratio in ([5]) 
infinite. Regularization helps by penalizing the L2 norm of the change, but does not account how 
the direction of t he miii imizer is sensitive to changes in the data. 



Dwork et al.l (|2006bl ) also provide a standard recipe for computing privacy-preserving approxi- 



mations to functions by adding noise with a particular distribution to the output of the function. 
We call this recipe the sensitivity method. Let g : (M'")"' — )• M be a scalar function of zi, . . . ,Zn, 
where Zi G M™" corresponds to the private value of individual i; then the sensitivity of g is defined 
as follows. 



Definition 4. The sensitivity of a function g : (M™)" — )• M zs maximum difference between the 
values of the function when one input changes. More formally, the sensitivity S{g) of g is defined 
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as: 



S{g)= max max \g{zi, . . . , Zi-i, Zi, Zi+i, . . . , Zn) - g{zi, . . . , Zi-i, 4, Zi+i, . . . , Zn)\ ■ (6) 

i Zl,...,Zn,zl 

To compute a function g on a dataset V = {zi, . . . , Zn}, the sensitivity method outputs 
g{zi, . . . ,Zn) + r], where 77 is a random variable drawn according to the Laplace distribution, with 
mean and standard deviation It is shown in lPwork et alJ (20063) that such a procedure is 
ep-differentially private. 



3 Privacy-preserving ERM 

Here we describe two approaches for creating privacy-preserving algorithms from ([1]). 
3.1 Output perturbation : the sensitivity method 

Algorithm [T] is derived from the sensitivity method of Dwork et al.l ( 2006bl ). a general method for 



generating a privacy-preserving approximation to any function A{-). In this section the norm || • || 
is the L2-norm unless otherwise specified. For the function A^D) = argmin J(f , P), Algorithm 1 
outputs a vector AlV) + b, where b is random noise with density 



b) = le-^ll^ll , 
a 



(7) 



where a is a normalizing constant. The parameter /3 is a function of ep, and the L2- sensitivity of 
A{-), which is defined as follows. 

Definition 5. The L2- sensitivity of a vector-valued function is defined as the maximum change in 
the L2 norm of the value of the function when one input changes. More formally, 

S'(^) = max max \\A{zi, . . . , Zi, . . .) — A{zi, . . . , z'^, . . .)\\ . (8) 

i zi,...,z„,zl 



The interested reader is referred to Dwork et al. ( 2006bl ) for further details. Adding noise to 



the output of A(-) has the effect of masking the effect of any particular data point. However, in 
some applications the sensitivity of the minimizer argmin J(f , T>) may be quite high, which would 
require the sensitivity method to add noise with high variance. 

Algorithm 1 ERM with output perturbation (sensitivity) 
Inputs: Data T> = {zi}, parameters €p, A. 
Output: Approximate minimizer fpriv 
Draw a vector b according to ([7|) with /? = "'^^ . 



Compute fpriv = argmin J(f , V) + b. 
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3.2 Objective perturbation 

A different approacli, first proposed by Chaudhuri and Monteleoni ( 20081 ). is to add noise to the 
objective function itself and tlien produce the minimizer of the perturbed objective. That is, we 
can minimize 



Jpriv(f,^^) 



J(f,P) + -b^f, 

n 



(9) 



where b has density given by ([7]), with (3 = ep. Note that the privacy parameter here does not 
depend on the sensitivity of the of the classification algorithm. 



Algorithm 2 ERM with objective perturbation 
Inputs: Data "D = {^i}, parameters ep, A, c. 
Output: Approximate minimizer fpriv 
Lete; = 6p-log(l + || + ;^). 

If e' > 0, then A = 0, else A = , ^ - A, and e' = e„/2. 
Draw a vector b according to ([7]) with /3 = 6^/2. 
Compute fpriv = argmin Jpi.iv(f , 2?) + ^A||f|p. 



The algorithm requires a certain slack, log(l + ^ + ^7^)1 in the privacy parameter. This is due 
to additional factors in bounding the ratio of the densities. The "If" statement in the algorithm 
is from having to consider two cases in the proof of Theorem [2l which shows that the algorithm is 
differentially private. 



3.3 Privacy guarantees 

In this section, we establish the conditions under which Algorithms 1 and 2 provide ep-differential 
privacy. First, we establish guarantees for Algorithm 1. 



3.3.1 Privacy Guarantees for Output Perturbation 

Theorem 1. If N(-) is differentiable, and 1-strongly convex, and i is convex and differentiable, 
with \i'{z)\ < 1 for all z, then, Algorithm 1 provides ep- differential privacy. 



The proof of Theorem [T] follows from Corollary [U and iDwork et al.l (|2006bl ). The proof is 
provided here for completeness. 



Proof. From Corollary [H if the conditions on N[-) and £ hold, then the L2-sensivity of ERM with 
regularization parameter A is at most We observe that when we pick ||b|| from the distribution 

in Algorithm 1, for a specific vector bo G M'^, the density at bo is proportional to e 2^11^011. Let 
P and V be any two datasets that differ in the value of one individual. Then, for any f , 



g{i\V') Kb2) ' ^ ^ 

where bi and b2 are the corresponding noise vectors chosen in Step 1 of Algorithm 1, and g{i\D) 
{g{^\T)') respectively) is the density of the output of Algorithm 1 at f, when the input is T) (D' 
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respectively). If fi and £2 are the solutions respectively to non-private regularized ERM when the 
input is D and D', then, b2 — bi = £2 — fi. From Corollary [U and using a triangle inequality, 

llbill - ||b2|| < ||bi - ball = ||f2 - fill < (11) 

nA 

Moreover, by symmetry, the density of the directions of bi and b2 are uniform. Therefore, by 

construction, < C^^. The theorem follows. □ 

The main ingredient of the proof of Theorem [T] is a result about the sensitivity of regularized 
ERM, which is provided below. 

Lemma 1. Let G(f) and g{i) be two vector-valued functions, which are continuous, and differen- 
tiable at all points. Moreover, let G(f) and G(f) + g{f) be X-strongly convex. If = argmiuf G(f) 
and £2 = argmiuf G(f) + g{i), then 

||fi-f2||<^max||V^/(f)||. (12) 

Proof. Using the definition of fi and £2 , and the fact that G and g are continuous and differentiable 
everywhere, 

VG(fi) = VG(f2) + Vg(f2) = 0. (13) 
As G(f) is A-strongly convex, it follows from Lemma 14 of Shalev-Shwartz ( 200?! ) that: 

(VG(fi) - VG(f2))^(fi - fa) > A ||fi - faf . (14) 

Combining this with (jl3p and the Cauchy- Schwartz inequality, we get that 

||fi - fall • ||V5(f2)|| > (fi - f2)^V<7(f2) = (VG(fi) - VG(f2))^(fi - fa) > A ||fi - faf . (15) 

The conclusion follows from dividing both sides by A ||fi — fa||. □ 



Corollary 1. IfN{-) is differentiable and 1-strongly convex, and I is convex and differentiable with 
|-^'(^)| ^ 1 for all z, then, the L2- sensitivity of J{i,T>) is at most 

Proof. Let V = {(xi, yi), . . . , (x„, y„)} and V = {(xi, yi), . . . , (x^, y^)} be two datasets that differ 
in the value of the n-th individual. Moreover, we let G(f) = J(f,'D), y(f) = J(f,P') — J(f, P), 
fi = argmiUf J(f,P), and fa = argmiUf J(f, P'). Finally, we set g{i) = ^(^(y^f^x^) - £(ynf^x„)). 

We observe that due to the convexity of £, and 1-strong convexity of A^(-), G'(f) = J(f,P) is 
A-strongly convex. Moreover, G(f) -|- ^(f) = J{f,'D') is also A-strongly convex. Finally, due to the 
differentiability of A^(-) and £, G(f) and g{{) are also differentiable at all points. We have: 

Vy(f) = -(y„£'(y„f'^x„)x„ - y;/(y;f^x:,)x;). (16) 
n 

As y, E [-1,1], \£'iz)\ < 1, for all z, and ||xi|| < 1, for any f, ||Vy(f)|| < i(||x„ - x^,||) < 

^(||x„,|| -|- ||xj^||) < ^. The proof now follows by an application of Lemma [H □ 
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3.3.2 Privacy Guarantees for Objective Perturbation 

In this section, we show that Algorithm 2 is ep-differentially private. This proof requires stronger 
assumptions on the loss function than were required in Theorem [H In certain cases, some of these 
assumptions can be weakened; for such an example, see Section I3.4.2I 

Theorem 2. If N{-) is 1-strongly convex and doubly differentiable, and £{■) is convex and doubly 
differentiable, with \^'{z)\ < 1 and \l"{z)\ < c for all z, then Algorithm 2 is ep- differentially private. 



Proof. Consider an fpriv output by Algorithm 2. We observe that given any fixed fpriv cind a fixed 
dataset P, there always exists a b such that Algorithm 2 outputs fpriv on input T). Because £ 
is differentiable and convex, and N{-) is differentiable, we can take the gradient of the objective 



function and set it to at fpriv Therefore, 



b = -nAViV(fpriv) - yd'{y^^prl.^i)^^ " ^Afpriv (17) 

Note that ^ holds because for any f , V^(f'^x) = f (f'^x)x. 

We claim that as ^ is differentiable and J(f,P) + ^llflP is strongly convex, given a dataset 
V = (xi,yi), . . . , (x„,y„), there is a bijection between b and fpriv The relation (fT7|) shows that 
two different b values cannot result in the same fpriv Furthermore, since the objective is strictly 
convex, for a fixed b and P, there is a unique fpriv; therefore the map from b to fpriv is injective. 
The relation (|17p also shows that for any fpriv there exists a b for which fpriv is the minimizer, so 
the map from b to fpriv is surjective. 

To show ep-differential privacy, we need to compute the ratio q(fpriv|P)/ q(fprivlPO of the d en- 



sities of fpriv under the two datasets T) and V . This ratio can be written as (jBillingslevl . Il995l ) 

g(fpriv|P) ^ li{h\V) |det(J(fpriv^b|P))|-^ 
g{%ri.\V') /i(b'|P') ■ I det(J(fpriv ^ h'\V'))\~^ ' 

where J(fpriv — ^ J(fpriv — ^ are the Jacobian matrices of the mappings from fpriv to b, 

and ^ih\V) and ^ih\V') are the densities of b given the output fpriv, when the datasets are V and 
V respectively. 

First, we bound the ratio of the Jacobian determinants. Let b^-'^ denote the j-th coordinate of 
b. From (1171) we have 



b(i) = -nAVAf(fpriv)(^') - ^/(y,fp^rivX.)xP - nM%. 

i=l 



Given a dataset 2?, the (j, /c)-th entry of the Jacobian matrix J(f — ?• h\T>) is 

^ = -nKV^N{%,a''^) - Y^yfCiy^,^^^^^^ - nAl(j = k), 

^^prlv i 

where l(-) is the indicator function. We note that the Jacobian is defined for all fpriv because N{- 
and I are globally doubly differentiable. 
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Let D and D' be two datasets which differ in the value of the n-th item such that 
^ = {(xi,yi),. . . , (x„_i,y„_i), (x„,y„,)} and V = {(xi,yi), . . . , (x„_i, (x^,y^)}. Moreover, 
we define matrices A and E as fohows: 

n 

A = nAV2iV(fpHv) + yf^"(y*fp^rivXi)xixf + nA/rf 

i=l 

p__ 2 nil/ fT \ T./ I\2nlli I fT I \ I l'^ 

Then, J(fpHv ^ b|P) = -A, and J(fpriv ^ b|P') = -(yl + E). 

Let Ai(M) and A2(M) denote the largest and second largest eigenvalues of a matrix M. As E 
has rank at most 2, from Lemma [21 

|det(J(fpriv ^ b|P'))l _ \ det{A + E)\ 



det(J(fpriv b|P))| I det A| 

= |1 + Xi{A~^E) + X2{A~^E) + Ai(^-^S)A2(A"^£;)|. 



For a 1-strongly convex function A^, the Hessian V A^(fpriv) has eigenvalues greater than 1 (jBovd and Vandenberghe 



2004 ) . Since we have assumed i is doubly differentiable and convex, any eigenvalue of A is therefore 

A,(g)| 



at least nA + nA; therefore, for j = 1,2, \Xj{A ^E)\ < ■ Applying the triangle inequality 



to the trace norm: 

|Ai(ii;)| + |A2(i5;)| < \yli"{yj^ri.^n)\ ■ ||x„|| + | - {y'J^ f {y'j^,,,^'J\ ■ \\^'^\ 
Then upper bounds on \yi\, ||xj||, and |^"(^;)| yield 

|Ai(^)| + |A2(^)| <2c. 
Therefore, |Ai(^)| • |A2(^)| < c^, and 
|det(A + £)| 2. 



\det{A)\ - n(A + A) n2(A + A)2 ^ n{A + A) 

We now consider two cases. In the first case, A = 0, and by definition, in that case, 1 + ^ + jf/^i < 
g«p-«p_ In the second case, A > 0, and in this case, by definition of A, (1+ )^ = e'^*'/^ = e^''~^p. 

Next, we bound the ratio of the densities of b. We observe that as < 1, for any z and 

< 1, for datasets D and D' which differ by one value, 

b ~ b = (y^fpj.jyX^)x,2 y^^i (?/nfpj,jyX n)x ^. 

This implies that: 

||b|| - ||b'|| < ||b- b'll < 2. 

We can write: 

/^(^l^) = " " ^ ^"rf(llbll) < ge;(||b|h||b'||)/2 < g4 



/i(b'|P') llb'll'^-ie-'pll'"' 



2 1 

surf(llb'll) 
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where surf(a;) denotes the surface area of the sphere in d dimensions with radius x. Here the last 
step foUows from the fact that surf(3;) = s{l)x'^~^, where s(l) is the surface area of the unit sphere 
in M''. 

FinaUy, we are ready to bound the ratio of densities: 

g{ipn.\V) ^ /^(b|P) |det(J(fpHv^b|PO)| 

<7(fpriv|^^') /u(b'|PO ■ I det(J(fpHv ^ b'|P))| 

_ n{h\V) \det{A + E)\ 

~ n{h'\V') Idet^l 

< e^p. 

Thus, Algorithm 2 satisfies Definition [2j □ 
Lemma 2. If A is full rank, and if E has rank at most 2, then, 

= X^iA-^E) + HA-^E) + X^^^ E) ^A^^ E) , (18) 

where \j{Z) is the j-th eigenvalue of matrix Z. 

Proof. Note that E has rank at most 2, so A~^E also has rank at most 2. Using the fact that 
Xi{I + A^^E) = 1 + X,{A~^E), 

det(A + i^)-det(^)^ _^ 
det A 

= (1 + Ai(^-^^))(1 + X2{A-^E)) - 1 

= Xi{A~^E) + X2{A~^E) + Xi{A"^E)X2{A~^E). 

□ 

3.4 Application to classification 

In this section, we show how to use our results to provide privacy-preserving versions of logistic 
regression and support vector machines. 

3.4.1 Logistic Regression 

One popular ERM classification algorithm is regularized logistic regression. In this case, A^(f) = 
^||f|P, and the loss function is ihR{z) = log(l + e~^). Taking derivatives and double derivatives, 

4r(^) 



(l + e^) 
^lr(^) ^ 



(1 + e-^)(l + 



Note that ^lr is continuous, differentiable and doubly differentiable, with c < ^. Therefore, we can 
plug in logistic loss directly to Theorems [1] and [2] to get the following result. 



13 



Corollary 2. The output of Algorithm 1 with iV(f) = l||f||2, £ = 

is an €„- differentially private 

approximation to logistic regression. The output of Algorithm 2 with N{i) = ^||f|p, c = \, and 
i = £lR) is an ep- differentially private approximation to logistic regression. 

We quantify how well the outputs of Algorithms 1 and 2 approximate (non-private) logistic 
regression in Section [H 



3.4.2 Support Vector Machines 



Another very commonly used classifier is L2-regularized support vector machines. In this case, 
again, A^(f) = ^||f|P, and 



^svm(^;) = max(0, 1- z). 



(19) 



Notice that this loss function is continuous, but not differentiable, and thus it does not satisfy 
conditions in Theorems [1] and [2j 

There are two alternative solutions to this. First, we ca n approxi r nate fa vM by a different loss 
function, which is doubly differentiable, as follows (see also IChapelld (j2007l )): 







1 



(l_^)4 3{l-zf i^z I 3h -f 
16/i3 8h 2 16 



if z>l + h 

l-z\<h 
if z < I - h 



(20) 



As /i — )• 0, this loss approaches the hinge loss. Taking derivatives, we observe that: 

>l + h 







3(1-^) _ 1 

IhF' ih 2 

1 



if 

if \l- z\<h 
if z <1- h 



(21) 



Moreover, 



+ ik if 



if z>l + h 

I- z\<h 
if z <l- h 



(22) 



Observe that this implies that |£s(z)| < for all h and z. Moreover, Ig is convex, as l's{z) > 
for all z. Therefore, ig can be used in Theorems [1] and [21 which gives us privacy-preserving 
approximations to regularized support vector machines. 



Corollary 3. The output of Algorithm 1 with N{{) 



i||f|P, and £ 



is an €„- differentially 



private approximation to support vector machines. The output of Algorithm 2 with N{i) = i||f|p, 



and £ = £s is an ep- differentially private approximation to support vector machines. 



The second solution is to use Huber Loss, as suggested bv lChapelld (|2007l ). which is defined as 
follows: 



\i z>l + h 

^Huber(2) = <( ijl{l + h-zf if \l - z\ < h 

1 - z ii z <1 - h 



(23) 
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Observe that Huber loss is convex and differ entiable, and piecewise doubly-differentiable, with 
c = However, it is not globahy doubly differentiable, and hence the Jacobian in the proof of 
Theorem [2] is undefined for certain values of f. However, we can show that in this case, Algorithm 
2, when run with c = ^ satisfies Definition [3l 

Let G denote the map from fpj-iv to b in (jl7p under 8 = 1), and H denote the map under 
B = v. By definition, the probability F(fpriv (^S\B = V) = Pb(b G G{S)). 

Corollary 4. Let fpHv be the output of Algorithm 2 with £ = ^Hubcr, c = and N{i) = ^||f||2- 
For any set S of possible values o/fpriv, and any pair of datasets T>, T>' which differ in the private 
value of one person {xn,yn), 

e-^pF{S \ B = V') <F{S \ B = V) < e^pF{S \ B = V'). (24) 



Proof. Consider the event fpnv G S. Let T = G{S) and T' = H{S). Because G is a bijection, we 
have 

P(fprivGcS I e=P)=Pb(bGr I ^ = P), (25) 

and a similar expression when B = T)' . 

Now note that ^Huber(-^) ^'^^y non-differentiable for a finite number of values of z. Let Z be 
the set of these values of z. 

C = {f : yf^x = zeZ, (x, y) G P U V']. (26) 

Pick a tuple (z, (x, y)) G 2 x (2? U T)'). The set of f such that yf-^x = z is a hyperplane in M*^. 
Since VA^(f) = f/2 and I' is piecewise linear, from (jl7p we see that the set of corresponding b's 
is also piecewise linear, and hence has Lebesgue measure 0. Since the measure corresponding to b 
is absolutely continuous with respect to the Lebesgue measure, this hyperplane has probability 
under b as well. Since C is a finite union of such hyperplanes, we have P(b G G{C)) = 0. 

Thus we have Pb(7~ \ B = D) = Pb(G(5\C) | B = D), and similarly for D'. From the definition 
of G and for f G 5 \ C, 

H{f) = G(f) + y„/(y„f^x„)x„ - y;/(y;f^x:,)x;. (27) 

since f ^ C, this mapping shows that if Pb(G(5 \ C) \ B = D) = then we must have Pb(-ff(5 \ 
C) \ B = T)) = 0. Thus the result holds for sets of measure 0. If 5 \ C has positive measure we 
can calculate the ratio of the probabilities for fpriv for which the loss is twice-differentiable. For 
such fpriv the Jacobian is also defined, and we can use a method similar to Theorem [2] to prove the 
result. □ 

Remark: Because the privacy proof for Algorithm [1] does not require the analytic properties 
of [21 we can also use Huber loss in Algorithm [T] to get an eg-differentially private approximation to 
the SVM. We quantify how well the outputs of Algorithms 1 and 2 approximate private support 
vector machines in Section |H These approximations to the hinge loss are necessary because of 
the analytic requirements of Theorems [J and [2] on the loss function. Because the requirements of 
Theorem [2] are stricter, it may be possible to use an approximate loss in Algorithm 1 that would 
not be admissible in Algorithm 2. 
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4 Generalization performance 



In this section, we provide guarantees on the performance of privacy-preserving ERM algorithms 
in Section [3l We provide these bounds for L2-regularization. To quantify this performance, we 
win assume that the n entries in the dataset T> are drawn i.i.d. according to a fixed distribution 
P{x,y). We measure the performance of these algorithms by the number of samples n required to 
acheive error L* + Cg, where L* is the loss of a reference ERM predictor fg. This resulting bound 
on eg will depend on the norm ||fo|| of this predictor. By choosing an upper bound z/ on the norm, 
we can interpret the result as saying that the privacy-preserving classifier will have error Cg more 
than that of any predictor with ||fo|| < z^. 

Given a distribution P the expected loss L(f) for a classifier f is 

L(f) = E(,,,)^p[£(f^x,y)]. (28) 

The sample complexity for generalization error €g against a classifier fg is number of samples n 
required to achieve error L{{q) + eg under any data distribution P. We would like the sample 
complexity to be low. 

For a fixed P we define the following function, which will be useful in our analysis: 

J(f) = L(f) + |||ff . (29) 

The function J(f) is the expectation (over P) of the non-private L2-regularized ERM objective 
evaluated at f . 

For non-private ERM, Shalev-Shwartz and Srebrol ( 20081 ) show that for a given fo with loss 



L(fo) = L*, if the number of data points 

n > C^M}^ (30) 
e 

9 

for some constant C, then the excess loss of the L2-regularized SVM solution fsvm satisfies L{isvm) < 
L(fo) -|- Eg. This order growth will hold for our results as well. It also serves as a reference against 
which we can compare the additional burden on the sample complexity imposed by the privacy 
constraints. 

For most learning problems, we require the generalization error eg < 1. Moreover, it is also typ- 
ically the case that for more difficult learning problems, ||fo|| is higher. For example, for regularized 
SVM, is the margin of classification, and as a result, ||fo|| is higher for learning problems with 
smaller margin. From the bounds provided in this section, we note that the dominating term in 
the sample requirement for objective perturbation has a better dependence on ||fo|| as well as — ; 

tg 

as a result, for more difficult learning problems, we expect objective perturbation to perform better 
than output perturbation. 



4.1 Output perturbation 

First, we provide performance guarantees for Algorithm [U by providing a bound on the number of 
samples required for Algorithm [T] to produce a classifier with low error. 

Definition 6. A function g{z) : ^ R is c- Lips chitz if for all pairs {zi, Z2) we have\g{zi)—g{z2)\ < 

c\zi - Z2\. 
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Recall that if a function g{z) is differentiable, with < r for all z, then g{z) is also 

r-Lipschitz. 

Theorem 3. Let iV(f) = ^||f|p, and let fo be a classifier such that L(fo) = L* , and let 6 > 0. If £ 
is differentiable and continuous with \i'{z)\ < 1, the derivative i' is c-Lipschitz, the data T> is drawn 
i.i.d. according to P, then there exists a constant C such that if the number of training samples 
satisfies 

/^ iifoipiog(i) diog(f)iifoii d\og{iyi^\m^ \ 

where d is the dimension of the data space, then the output fpriv of Algorithm^ satisfies 

P (L(fpri^) <L* + eg)>l- 26. (32) 

Proof. Let 

frtr = argmin J(f) 
f 

f* = argmin J(f, P), 
f 



and fp riv denote the output of Algorithm[Tl Using the analysis method of lShalev-Shwartz and Srebro 
(|200a ) shows 

L(fpriv) = L(fo) + (J(fpriv) - J(frtr)) + (J(frtr) " J%)) + |||fo||' " |||fpriv||'. (33) 



We will bound the terms on the right-hand side of ()33p . 

For a regularizer iV(f) = ^||f|p the Hessian satisfies ||V^A^(f)||2 < 1 • Therefore, from Lemma 
[3l with probability 1 — 6 over the privacy mechanism, 

Furthermore, the results of Sridharan et alJ ( 20081 ) show that with probability 1 — 6 over the choice 
of the data distribution. 



J(fpriv) - J(frtr) < 2(j(fpriv,i^) - J{r,v)) + o 



log(l/'^) 
An 



The constant in the last term depends on the derivative of the loss and the bound on the data points, 
which by assumption are bounded. Combining the preceeding two statements, with probability 
1 — 26 over the noise in the privacy mechanism and the data distribution, the second term in the 
right-hand-side of ([33]) is at most: 

i(W-J(f„).i^^Mi^ + 0(M). (34) 

By definition of frtr, the difference (J(frtr) — <^(fo)) < 0. Setting A = jj^^ in (j33]l and using (p4|l . 
we obtain 

L(fp„,) < m) + ie||fol|Vlog'W^)(. + .,/||f„||-) ^ ^ (||,„||.!2iM)) + (35) 

n^e^e^ V neg J 2 

Solving for n to make the total excess error equal to e„ yields ([3T]) . □ 
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Lemma 3. Suppose N{-) is doubly differentiable with ||V^A^(f)||2 < rj for all f, and suppose that 
i is differentiable and has continuous and c-Lipschitz derivatives. Given training data T>, let f * be 
a classifier that minimizes J{{,T>) and let fpriv be the classifier output by AlgorithmUl Then 

J(fpriv,P) < J(f ,P) H A%2^2 j>l-o, (36) 

where the probability is taken over the randomness in the noise b of AlgorithmU^ 

Note that when i is doubly differentiable, c is an upper bound on the double derivative of i, 
and is the same as the constant c in Theorem [2j 

Proof. Let V = {(xi,yi), . . . , (x„,y„)}, and recall that ||xj|| < 1, and \yi\ < 1. As N{-) and £ are 
differentiable, we use the Mean Value Theorem to show that for some t between and 1, 

j(fpri.,p) - J{r,v) = (fpHv - rfvj{tr + (i - 1)^^,) 

< ||fpri.-r|| •||vj(tr + (i-t)fpriv)||, (37) 

where the second step follows by an application of the Cauchy-Schwartz inequality. Recall that 

VJ(f,P) = AViV(f) + 1 Vy/(yif^Xi)xi. 

n ^-^ 

i 

Moreover, recall that VJ(f*,P) = 0, from the optimality of f*. Therefore, 

VJ(tf* + (1 - t)fpriv,:P) = VJ(f*,P) - A(ViV(f*) - VN{tr + (1 - t)fpriv)) 

-]^Y.y^ {^'{yi{n''^^) - ^\y^itr + (i - t)fpriv)^xO) x,. (38) 

i 

Now, from the Lipschitz condition on £, for each i we can upper bound each term in the summation 
above: 



\yi (/(y,(r)^Xi) - £'{yi{tr + (1 - t)%,,,f^i)) x, 



< \yi\ ■ llxill • |£'(yi(f*)^x,) - l'{y,{tr + (1 - t)fpriv)'^x,)| 

< l^il • l|xi|| • c • \yi{l - t)(f* - fpi.iv)'^Xi| 
<c(l-t)|yi|2.||xi||2.||f*-fp,i,|| 

<c(l-t)||r-fpriv||. (39) 

The third step follows because (.' is c-Lipschitz and the last step follows from the bounds on \yi\ 
and ||x.j||. Because is doubly differentiable, we can apply the Mean Value Theorem again to 
conclude that 

||VA(tf* + (1 - t)fpriv) - VA(f*)|| < (1 - Ollfpriv - ril • ||V2A(f'0||2 (40) 
for some f" G M'^. 
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As < t < 1, we can combine and (gOj) to obtain 

\\vj{tr + (1 - t)fprw,v)\\ < ||A(viv(r) - vN{tr + (i - t)fpriv))|| 

+ i ^y^(/(yi(f*)^x,) - e'{y,itr + (1 - i)fpriv)''xi))xi 

i 

< (l-t)||fpriv-f*|| ■ (^Av + ^-n-c 

< ||fpriv-f*||(Ar/ + c). (41) 

From the definition of Algorithm[Tl fpriv — f* = b, where b is the noise vector. Now we can apply 
f*ii „,;+i^ ,^„v.ovvi„+, 

pnv 



Lemma H] to 1 1 fpriv — f * 1 1 , with parameters k = d, and 9 = . From Lemma HI with probabihty 
1 — 5, 1 1 fpriv ~ f*|| ^ '^^^ Lemma follows by combining this with Equations 1411 and 1371 

□ 

Lemma 4. Let X be a random variable drawn from the distribution T{k, 9), where k is an integer. 
Then, 

X <k9 log (^^^^ > 1-6. (42) 

Proof. Since k is an integer, we can decompose X distributed according to T{k, 9) as a summation 

X = Xi + ... + Xk, (43) 

where Xi,X2, . . . ,Xk are independent exponential random variables with mean 9. For each i we 
have F{Xi > 9log{k/S)) = 6/k. Now, 

¥{X < k9logik/6)) >F{Xi <9\og{k/5) i = l,2,...,k) (44) 
= (1 - 6/k)'' (45) 
>l-d. (46) 

□ 

4.2 Objective perturbation 

We now establish performance bounds on Algorithm [2l The bound can be summarized as follows. 

Theorem 4. Let N{i) = ^||f|p, and let fg be a classifier with expected loss L(fo) = L* . Let £ 
be convex, doubly differentiable, and let its derivatives satisfy \i'{z)\ < 1 and \i"{z)\ < c for all z. 
Then there exists a constant C such that for 6 > 0, if the n training samples in T> are drawn i.i.d. 
according to P, and if 

then the output fpriv of Algorithmic satisfies 

P (L(fpnv) <L* + eg)>l- 25. (48) 
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Proof. Let 



frtr = argmin J(f) 
f 

f* = argmin J(f, "D), 
f 



andfpriv denote the output of Algorithm[TJ As in Theorem^ the analysis of Shalev-Shwartz and Srebrol 
Jiooi) shows 

L(fpHv) = L{io) + (J(fpnv) - J(frtr)) + (J(frtr) " J(fo)) + |||fo||' " ^l|fpnv||'. (49) 

We will bound each of the terms on the right-hand-side. 

If n > and A > .nl^a-) , then nA > -7^ , so from the definition of e' in Algorithm [2l 

e'p = ep- 21og (1 + = - 21og (l + |) > - |, (50) 

where the last step follows because log(l + x) < x ior x £ [0, 1]. Note that for these values of A we 
have e'p > 0. 

Therefore, we can apply Lemma [5] to conclude that with probability at least 1 — 6 over the 
privacy mechanism, 

j(fpH..,p)-j(f.p)< '"'';°„y . (51) 



From ISridharan et al.1 (|2008l ) , 



J(fpHv) - J(frtr) < 2(J(fpriv,I^) " J{r,V)) + O (^^^^T^) ^^^^ 

~ An^ej \ An J 

By definition of f*, we have J(frtr) — «^(fo) < 0. If A is set to be j]^^) then, the fourth quantity 
in Equation [39] is at most ^ . The theorem follows by solving for n to make the total excess error 
at most eg. □ 

The following lemma is analogous to Lemma [3l and it establishes a bound on the distance 
between the output of Algorithm [21 and non-private regularized ERM. We note that this bound 
holds when Algorithm [2] has > 0, that is, when A = 0. Ensuring that A = requires an 
additional condition on n, which is stated in Theorem [H 

Lemma 5. Let e'p > 0. Let f* = argmin J(f, P), and let fpriv be the classifier output by Algorithm 
[B If ^{') 1-strongly convex and globally differentiable, and if i is convex and differentiable at 
all points, with \l'{z)\ < 1 for all z, then 

4(^2 log2(d/(^) 

where the probability is taken over the randomness in the noise b of Algorithmic 



Pb ( J{ipri.,V) < Jir,V) + ^ ) > 1 - 5, (54) 
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Proof. By the assumption > 0, the classifier fpi-i^ minimizes the objective function J(f , P)+-b f , 
and therefore 



J(fpriv,:P) < J{r,V) + -h^{r - fpriv). (55) 

n 



First, we try to bound ||f* — fprivH- Recah that AA^(-) is A-strongly convex and globally differen- 
tiable, and £ is convex and differentiable. We can therefore apply Lemma [T] with G(f) = J(f,P) 
and 5(f) = ^b-^f to obtain the bound 



If* - f ■ II < — 

I J- J-priv 11^^ 



Therefore by the Cauchy-Schwartz inequality, 



V(ib^f) 

n 



\M , ^ 

< (56) 



lb 



|2 



J(fpHv,^?)- J(f*,^?)<^. (57) 

Since ||b|| is drawn from a r(d, ^) distribution, from Lemma HI with probability 1 — (5, ||b|| < 
2a!iog(d/(5) ^ rpj^^ Lemma follows by plugging this in to the previous equation. □ 

4.3 Applications 

In this section, we examine the sample requirement of privacy-preserving regularized logistic re- 
gression and support vector machines. Recall that in both these cases, A'^(f) = ^||f|p. 

Corollary 5 (Logistic Regression). Let training data D be generated i.i.d. according to a distribu- 
tion P and let fo be a classifier with expected loss L{{q) = L* . Let the loss function i = ^lr defined 
in Section \3.4-l\ Then the following two statements hold: 

1. There exists a Ci such that if 

„>c,.»fWMi)lMaM,li2l«V (58) 

then the output fpriv of Algorithm{l\ satisfies 

F {L{iprw) < L* + Eg) > I - 5. (59) 

2. There exists a C2 such that if 

„>c.».(^£isM),^,lMMMy (00) 

then the output fp^v of AlgorithmlEwith c=\ satisfies 

P(L(fp,i,) <L* + e3) > 1-5. (61) 
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Proof. Since £lr is convex and doubly differentiable for any zi, Z2, 

4r(^i) - 4r(^2) < e'Uz*){zi - Z2) (62) 

for some z* G [2:1,2:2]. Moreover, |^lr(-^*)I ^ c = |, so is ^-Lipschitz. The corollary now follows 
from Theorems [3] and [H □ 

For SVMs we state results with I = ^HubcD but a similar bound can be shown for as well. 

Corollary 6 (Huber Support Vector Machines). Let training data T> be generated i.i.d. according 
to a distribution P and let fg be a classifier with expected loss -L(fo) = L* . Let the loss function 
^ = ^Hubcr defined in (|23p . Then the following two statements hold: 



1. There exists a Ci such that if 

^'|fo||2log(i) dlog(f)||fo|| dlog(f)||fo 



n > C, max , "JV ' ,v 3 ^ I ' ^^S) 

then the output fpriv of Algorithm{I\ satisfies 



r{L{ip,,,)<L* + eg)>l-5. (64) 

2. There exists a C2 such that if 

„>c„„fwM,iiw^Mimy (65) 

then the output fpriv of Algorithmic with c=\ satisfies 

P(L(fpriv) <^* + e9) > 1-5. (66) 



Proof. The Huber loss is convex and differentiable with continuous derivatives. Moreover, since 
the derivative of the Huber loss is piecewise linear with slope or at most 27^, for any 2:1, 22, 

KHubcr(^l) - 4ubcr(^2)| < - 22I, (67) 

^Huber ^-Lipschitz. The first part of the corollary follows from Theorem [3j 
For the second part of the corollary, we observe that from Corollary SI we do not need H. to 
be globally double differentiable, and the bound on |^"(2)| in Theorem [His only needed to ensure 
that e'p > 0; since ^Huber is double differentiable except in a set of Lebesgue measure 0, with 
l-^Huber(-^)l — sk' corollary follows by an application of Theorem HI □ 
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5 Kernel methods 



A powerful methodology in learning problems is the "kernel trick," which allows the efficient con- 
struction of a predictor f that lies in a reproducing kernel Hilbert s pace (RKHS) 7i associated to 
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a positive definite kernel function k{-,-). The representer theorem (jKimeldorf and Wahbal . 
shows that the regularized empirical risk in ([T]) is minimized by a function f(x) that is given by a 
linear combination of kernel functions centered at the data points: 



f(x) 



i=l 



aik{x{i) 



X 



(68) 



This elegant result is important for both theoretical and computational reasons. Computationally, 
one releases the values corresponding to the f that minimizes the empirical risk, along with the 
data points x(i); the user classifies a new x by evaluating the function in (I68p . 

A crucial difficulty in terms of privacy is that this directly releases the private values x(i) of 
some individuals in the training set. Thus, even if the classifier is computed in a privacy-preserving 
way, any classifier released by this process requires revealing the data. We provide an algorit hm 



that avoids this problem, using an approximation method (jRahimi and Rechtl . 12007 
approximate the kernel function using random projections. 



2008bl ) 



to 



5.1 Mathematical preliminaries 

Our approach works for kernel functions which are translation invariant, so A;(x, x') = k{x — x'). 
The key idea in the random projection method is from Bochner's Theorem, which states that a 
continuous translation invariant kernel is positive definite if and only if it is the Fourier transform 
of a nonnegative measure. This means that the Fourier transform K(9) of translation-invariant 
kernel function A;(t) can be normalized so that K{6) = K{6)/ ||A'(^)||^ is a probability measure on 
the transform space Q. We will assume K{0) is uniformly bounded over 9. 
In this representation 

fc(x,x')= / (f){x;6)(l){x';e)K{e)de, (69) 

where we will assume the feature functions 8) are bounded: 

|(/'(x; e)\<C Vx G A-, G e. (70) 



A function i £T-L can be written as 

f (x) = / a{9)^{x;0)K{9)d9. (71) 

To prove our generalization bounds we must show that bounded classifiers f induce bounded func- 
tions a{9). Writing the evaluation functional as an inner product with A;(x, x') and ([69]) shows 

f (x) = jJ^J f{x')^{x';e)dx'^ <p{x;9)K{9)d9. (72) 
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Thus we have 

a{6) = [ f(x')<^(x';^)dx' (73) 
Jx 

|a(0)|<Vol(^)-C-||f|L- (74) 
This shows that a{9) is bounded uniformly over when f(x) is bounded uniformly over X. The 
volume of the unit bah is Yol{X) = -f^ (see E^l (Il997l ) for more details). For large d this is 



r(|+i) 

{\ I^^Y t>y Stirling's formula. Furthermore, we have 



|f||^ = / a{efK{e)de. (75) 



e 

5.2 A reduction to the linear case 

We now describe how to apply Algorithms 1 and 2 for classification with kernels, by transforming to 
linear classification. Given {^j}, let R : X ^ be the map that sends x(i) to a vector v(i) £ 
where Vj(z) = (j){x{i);9j) for j G [D]. We then use Algorithm 1 or Algorithm 2 to compute a 
privacy-preserving linear classifier f in M^. The algorithm releases R and f. The overall classifier 

isfpriv(x) =f(i?(x)). 



Algorithm 3 Private ERM for nonlinear kernels 

Inputs: Data {(xj,yj) : i G [n]}, positive definite kernel function k{-, •), sampling function K{9), 
parameters e^, A, D 

Outputs: Predictor fpriv and pre-filter {6j : j G [D]}. 

Draw {6j : j = 1, 2, . . . , D} iid according to K{9). 

Set v(i) = ^J2/D[(t){^(i); Oi) ■ ■ ■ (^(x(i); 0d)]^ for each i. 

Run Algorithm 1 or Algorithm 2 with data {(v(i), y(i))} and parameters Cp, A. 



As an example, consider the Gaussian kernel 

A:(x, x') = exp I —7 llx 



X 



(76) 



The Fourier transform of a Gaussian is a Gaussian, so we can sample 9j = {u, ip) according to the 
distribution Uniform[— vr, vr] x A/'(0, 27/^) and comp ute Vj = cosfc^-^x + -0). T he random phase is 
used to produce a real- valued mapping. The paper of lRahimi and Rechtl (j2008al ) has more examples 
of transforms for other kernel functions. 



5.3 Privacy guarantees 

Because the workhorse of Algorithm [3] is a differentially-private version of ERM for linear classifiers 
(either Algorithm [1] or Algorithm [2|) , and the points {9j : j G [D]} are independent of the data, the 
privacy guarantees for Algorithm [3] follow trivially from Theorems [T] and [21 

Theorem 5. Given data {(x(i),y(i)) : i = 1,2, . . . with {x{i),y{i)) and ||x(i)|| < 1, the outputs 
(fpriv, {^j '■ j G [D]}) of Algorithmic guarantee ep-differential privacy. 

The proof trivially follows from a combination of Theorems [H [21 and the fact that the 9j 's are 
drawn independently of the input dataset. 
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5.4 Generalization performance 

We now turn to generalization bounds for Algorithm [3j We will prove results using objective per- 
turbation (Algorithmic]) in Algorithm [3l but analogous results for output perturbation (Algorithm 
[1]) are simple to prove. Our comparisons will be against arbitrary predictors fg whose norm is 
bounded in some sense. That is, given an fg with some properties, we will choose regularization 
parameter A, dimension D, and number of samples n so that the predictor fpriv has expected loss 
close to that of fg . 

In this section we will assume A^(f) = | ||f||^ so that A^(-) is 1-strongly convex, and that the 
loss function i is convex, differentiable and |^'(-2)| < 1 for all z. 

Our first generalization result is the simplest, since it assumes a strong condition that gives 
easy guarantees on the projections. We would like the predictor produced by Algorithm [3] to be 
competitive against an fg such that 



fo(x) 



aoi9)(pix;0)K{e)de, 



(77) 



e 



and |ao(^)| < C (see iRahimi and Rechtl (j2008bl )). Our first result provides the technical building 
block f or our other generaliz ation results. The pro o f makes use of ideas from iRahimi and Recht 
( 2008b ) and techniques from Sridharan et al.l ( 20081 ) : Shalev-Shwartz and Srebrol ( 20081 ). 



Lemma 6. Let fg be a predictor such that \aQ{0)\ < C, for all 6, where ao{9) is given by ( (771 ), 
and suppose L(fo) = L* . Moreover, suppose that £'{•) is c-Lipschitz. If the data V is drawn i.i.d. 
according to P , then there exists a constant Cq such that if 



n > Co ■ max ^ • log ■ 



e f2 



egS 'eplog(l/(5) 



(78) 



then A and D can be chosen such that the output fp^iv of Algorithmic using Algorithmic satisfies 



P(L(f, 



pnv ; 



L* <eg)>l- 4(5. 



(79) 



Proof. Since |ao(^)| < C and the K{6) is bounded, we have (jRahimi and Rechtl . l2008bl . Theorem 
1) that with probability 1 — 25 there exists an fp G such that 



L(fp) < L(fo) + O 




(80) 



We will choose D to make this loss small. Furthermore, fp is guaranteed to have ||fp||„ < C/D, so 



p2 

If < — 



(81) 



Now given such an fp we must show that fpriv will have true risk c lose to that of fp as long as 
there are enough data points. This can be shown using the techniques in Shalev-Shwartz and Srebrol 
(1200811. Let 



J(f) = L(f) + |||f||^, 
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and let 

frtr = argmin J(f) 



minimize the regularized true risk. Then 

J(fpriv) = J{fp) + (J(fpriv) - J(frtr)) + (J(frtr) " J{fp)). 

Now, since J(-) is minimized by fftr, the last term is negative and we can disregard it. Then we 
have 



-^(fpriv) — -^(fp) ^ (<^(fpriv) — <^(frtr)) + I|fpll2 ~ I|fpriv|l2 • (^2) 

From Lemma O with probability at least 1 — 6 over the noise b, 

4Z)2 log^{D/6) 



^(fpriv) - J ^argmin J (f)^ < ^^^2^2 



(83) 



Now using ()Sridharan et al.l . 120081 . Corollary 2), we can bound the term (J(fpriv) — -^(frtr)) by twice 
the gap in the regularized empirical risk difference ()83p plus an additional term. That is, with 
probability 1 — 5: 

J(fpriv) - J(frtr) < 2( J(fpHv) " J(frtr)) + O (^^4^) " ^^^^ 

If we set n > then e'p > 0, and we can plug Lemma [5] into (f84|) to obtain: 

^^ 8DHog\D/6) , ^^ log(l/J) ^ 

J(fpnv) - J(frtr) < ^,2,2 + O ) ■ (85) 

Plugging ([85]) into ([82]) . discarding the negative term involving ||fpriv|l2 setting A = e^/ ||fp|p 
gives 

. ^ 8||fp||^D2log2(Z)M) /||fJ|2logi\ e„ , , 

L(fpHv) - L(f,) < "^"^ + O "''^^ ' + f . (86) 

Now we have, using (j80p and (j86p . that with probability 1 — 45: 
L(fpriv) - L(fo) < (L(fpriv) - L{ip)) + (L(fp) - L(fo)) 

^ 8||fp||^D2log^(D/^) ^ ^/ ||f,||^log(l/^) \ ^ eg 



Substituting (|8T]) . we have 
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To set the remaining parameters, we will choose D < n so that 



n^e^Eg \ Dneg J 2 \ ^Jb 



We set D = 0(0"^ log(l/5)/e^) to make the last term eg/6, and: 



log J log „ 2e 



L(fp„.) - L(fo) < O I ,.,.,3 - ) + O ) + 

Setting n as in ([78]) proves the result. Moreover, setting n > = Co • ^ logfi/i?) ^^^^^^^^ 

We can adapt the proof procedure to show that Algorithm [3] is competitive against any clas- 
sifier fo with a given bound on ||fo||oo- ^^^^ be shown that for some constant C that |ao(^)| ^ 
Vol(A')C ||fo||oo- Then we can set this as C in (|78p to obtain the following result. 

Theorem 6. Let fp be a classifier with norm ||fo||oo? ^'^^ ^'(O c-Lipschitz. Then for any 
distribution P, there exists a constant Cq such that if 

max l|follLC^(VolW)^A/MW . ||fo|L VolWClog(l/^) ceg \ 

n>Co-max log , , — - — 77777 > [o') 

\ eg(5r(f + l) eplog(l/(5)y 

then A and D can be chosen such that the output fpriv of Algorithm with Algorithm 2 satisfies 
P(L(fpri.)-L(fo) <e5) >l-A5. 

Proof. Substituting C = \o\{X)C, ||fo|loo ™ Lemma [6]we get the result. □ 
We can also derive a generalization result with respect to classifiers with bounded ||fo||-^- 

Theorem 7. Let fo be a classifier with norm ||fo||-^, and let i' be c-Lipschitz. Then for any 
distribution P, there exists a constant Cq such that if, 

,,,,7 l|follU^(V°lW)^A/MW ||fo||^VolWClog(l/^) ceg \ 

n - Cq- max ^ log , , — - — — — - , (^8»j 

V epe4 g^5r(f + l) ep\og{l/5) ) 

then A and D can be chosen such that the output of Algorithm run with Algorithm 2 satisfies 
P(L(fpHv)-L(fo)<e^)> 1-45. 

Proof. Let fg be a classifier with norm ||fo||^ and expected loss L(fo). Now consider 

frtr = argminL(f) + ||f||^ , 
f ^ 

for some Artr to be specified later. We will first need a bound on ||frtr|loo ™ order to use our previous 
sample complexity results. Since frtr is a minimizer, we can take the derivative of the regularized 
expected loss and set it to to get: 



A 



rtr 
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where P{x,y) is a distribution on pairs (x, y). Now, using the representor theorem, gj^^f(x) = 
A;(x', x). Since the kernel function is bounded and the derivative of the loss is always upper bounded 
by 1, so the integrand can be upper bounded by a constant. Since P{x,y) is a probability distri- 
bution, we have for all x' that |frtr(x')| = 0(1/Artr)- Now we set Aj-tr = ^g/ ||fo||^ to get 



^rtr 



o 



If l|2 



We now have two cases to consider, depending on whether L{{q) < L(frtr) or L{{q) > L(frti.)- 
Case 1: Suppose that L({q) < L(frtr)- Then by the definition of frtr, 



Since ^ • > 0, we have L(frtr) - L(fo) < ^. 

Case 2: Suppose that -L(fo) > i^(frtr)- Then the regularized classifier has better generalization 
performance than the original, so we have trivially that L(fi.tr) — -^^(fo) ^ ^■ 

Therefore in both cases we have a bound on || frtrl loo ^'^d a generalization gap of 6^/2. We can 
now apply Theorem [6] to show that for n satisfying ()87p we have 

P(L(fpriv)-L(fo)<e3)>l-45. 

□ 



6 Parameter tuning 

The privacy-preserving learning algorithms presented so far in this paper assume that the regular- 
ization constant A is provided as an input, and is independent of the data. In actual applications 
of ERM, A is selected based on the data itself. In this section, we address this issue: how to design 
an ERM algorithm with end-to-end privacy, which selects A based on the data itself. 

Our solution is to present a privacy-preserving parameter tuning technique that is applicable 
in general machine learning algorithms, beyond ERM. In practice, one typically tunes parameters 
(such as the regularization parameter A) as follows: using data held out for validation, train 
predictors f(-;^) multiple values of A, and select the one which provides the best empirical 
performance. However, even though the output of an algorithm preserves ep-differential privacy for 
a fixed A (as is the case with Algorithms [T] and [2]) , by choosing a A based on empirical performance 
on a validation set may violate gp-differential privacy guarantees. That is, if the procedure that 
picks A is not private, then an adversary may use the released classifier to infer the value of A and 
therefore something about the values in the database. 

We suggest two ways of resolving this issue. First, if we have access to a smaller publicly 
available data from the same distribution, then we can use this as a holdout set to tune A. This 
A can be subsequently used to train a classifier on the private data. Since the value of A does 
not depend on the values in the private data set, this procedure will still preserve the privacy of 
individuals in the private data. 

If no such public data is available, then we need a differentially private tuning procedure. We 
provide such a procedure below. The main idea is to train for different values of A on separate 
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subsets of the training dataset, so that the total training procedure still maintains Cp-differential 
privacy. We score each of these predictors on a validation set, and choose a A (and hence f (-; A)) 
using a randomized privacy-preserving comparison procedure ([McSherrv and Taiwan . 120071 ). The 
last step is needed to guarantee ep-differential privacy for individuals in the validation set. This 
final algorithm provides an end-to-end guarantee of differential privacy, and renders our privacy- 
preserving ERM procedure complete. We observe that both these procedures can be used for tuning 
multiple parameters as well. 



6.1 Tuning algorithm 



Algorithm 4 Privacy-preserving parameter tuning 



Inputs: Database T>, parameters {Ai, . . . , A^}, Cp. 
Outputs: Parameter fpriv 

Divide D into m + 1 equal portions Pi, ... , Dm+i, each of size - — - 



m+l ■ 

For each i = 1, 2, . . . , m, apply a privacy-preserving learning algorithm (e.g. Algorithms [TJ O or 
[3|) on Vi with parameter Aj and ep to get output fj. 

Evaluate Zi, the number of mistakes made by fj on P^+i- Set fpriv = fi with probability 

9^ = ym ^-e,../2 - (89) 



We note that the list of potential A values input to this procedure should not be a function of 
the private dataset. It can be shown that the empirical error on Prn+i of the classifier output by 
this procedure is close to the empirical error of the best classifier in the set {fi, . . . , fm} on Dm+i, 
provided |^?| is high enough. 

6.2 Privacy and utility 

Theorem 8. The output of the tuning procedure of Algorithm^ is Ep- differentially private. 

Proof. To show that Algorithm 4 preserves ep-differential privacy, we first consider an alternative 
procedure Ai. Let be the procedure that releases the values (fi, . . . , fm, i) where, fi, . . . , fm are 
the intermediate values computed in the second step of Algorithm 4, and i is the index selected by 
the exponential mechanism step. We first show that preserves ep-differential privacy. 

Let V and V' be two datasets that differ in the value of one individual such that V = "DuKx, y)}, 
and P' = PU{(x',y')}- 

Recall that the datasets Pi, . . . ,Pm+i are disjoint; moreover, the randomness in the privacy 
mechanisms are independent. Therefore, 



, fm,Pm+l) /^(fl, • • • , fm|P)'ifl • • • dim 

m 

, fm,Vm+l) n /^i(fj l^j)f^fl • • • dim, (90) 



'(fi G5i,...,f^ G5^,i = r|P) 

'{i = i*\fi, 



Six...S„ 



5ix...<S„ 



'(i= "|fi 
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where /Xj(f) is the density at f induced by the classifier run with parameter Aj, and /u(fi, . . . , fm) is 
the joint density at fi, . . . , induced by Ai. Now suppose that (x, y) G T>j, for j = m + 1. Then, 
Vk = ^fc, and fiji^fjlVj) = fij{ij\'Dj), for k G [m]. Moreover, given fl-iiy fixed, set f]^ , . . . , , 

P(i = i*|p;,+i,fi,...,f;„) < e^^'P(z = i*|P„,+i,fi,...,f;„). (91) 
Instead, if (x, y) e Dj, for j G [m], then, P^. = P^, for /c G [m + l],k ^ j. Thus, for a fixed 

f 1 ) • • • ) f m ) 

P (i = i*\V'^+^,h, ...,im)=F{i = i*\V,n+l,h, . . . , fm) (92) 

Mfc(ffclPfc) < e^-^fc(ffc|p;,). (93) 



The lemma follows by combining (j90p - (j93p . 

Now, an adversary who has access to the output of M can compute the output of Algorithm 
4 itself, withou t any f urther access to the dataset. Therefore, by a simulatibility argument, as 



m 



Dwork et al.l (|2006bl ). Algorithm 4 also preserves ep-differential privacy. □ 



In the theorem above, we assume that the individual algorithms for privacy-preserving classifi- 
cation satisfy Definition [21 a similar theorem can also be shown when they satisfy a guarantee as 
in Corollary m 

The following theorem shows that the empirical error on T>k+i of the classifier output by the 
tuning procedure is close to the empirical error of the best classifier in the set {fi, . . . , f^}- The 
proof of this Theorem follows from Lemma 7 of McSherrv and Talwar ( 200?! ). 

Theorem 9. Let Zmin = minj Zi, and let z be the number of mistakes made on D^+i by the classifier 
output by our tuning procedure. Then, with probability 1 — d, 

21og(m/(5) , 

Z<Znnn + (94) 



Proof. In the notation of iMcSherrv and Taiwan (|2007l ). the Zmin = OPT, the base measure fi is 
uniform on [m] , and St = {i : Zi < z^i^ + 1} . Their Lemma 7 shows that 

where /i is the uniform measure on [m]. Using min//(5i) = ;^ to upper bound the right side and 
setting it equal to 5 we obtain 

1 , m , , 

t = -log-. (96) 
ep d 

From this we have 

2 m\ 

z > Zmin H log ^ <5, (97) 



ep 6 , 

and the result follows. □ 
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7 Experiments 



In this section we give experimental results for training linear classifiers with Algorithms [T] and [2] 
on two real datasets. Imposing privacy requirements necessarily degrades classifier performance. 
Our experiments show that provided there is sufficient data, objective perturbation (Algorithm 
[2]) typically outperforms the sensitivity method ([T]) significantly, and achieves error rate close to 
that of the analogous non-private ERM algorithm. We first demonstrate how the accuracy of the 
classification algorithms vary with €p, the privacy requirement. We then show how the performance 
of privacy-preserving classification varies with increasing training data size. 

The first dataset we consid er is the Adult dataset from the UCI Machine Learning Repository 
(j Asuncion and Newmanl . 120071 ) . This moderately-sized dataset contains demographic information 
about approximately 47, 000 individuals, and the classification task is to predict whether the annual 
income of an individual is below or above $50,000, based on variables such as age, sex, occupation, 
and education. For our experiments, the average fraction of positive labels is about 0.25; therefore, 
a trivial classifier that always predicts —1 will achieve this error-rate, and only error-rates below 
0.25 are interesting. 

The second dataset we consider is the KDDCup99 dataset ( Hettich and Bay . 19991 ): the task here 
is to predict whether a network connection is a denial-of-service attack or not, based on several 
attributes. The dataset includes about 5,000,000 instances. For this data the average fraction of 
positive labels is 0.20. 

In order to im plement the con vex minimization procedure, we use the convex optimization 
library provided by Okazaki ( 20091 ). 



7.1 Preprocessing 

In order to process the Adult dataset into a form amenable for classification, we removed all entries 
with missing values, and converted each categorial attribute to a binary vector. For example, an 
attribute such as (Male , Female) was converted into 2 binary features. Each column was normalized 
to ensure that the maximum value is 1, and then each row is normalized to ensure that the norm of 
any example is at most 1. After preprocessing, each example was represented by a 105-dimensional 
vector, of norm at most 1. 

For the KDDCup99 dataset, the instances were preprocessed by converting each categorial at- 
tribute to a binary vector. Each column was normalized to ensure that the maximum value is 1, 
and finally, each row was normalized, to ensure that the norm of any example is at most 1. After 
preprocessing, each example was represented by a 119-dimensional vector, of norm at most 1. 



7.2 Privacy- Accuracy Tradeoff 

For our first set of experiments, we study the tradeoff between the privacy requirement on the 
classifier, and its classification accuracy, when the classifier is trained on data of a fixed size. The 
privacy requirement is quantified by the value of e^; increasing ep implies a higher change in the 
belief of the adversary when one entry in T> changes, and thus lower privacy. To measure accuracy, 
we use classification (test) error; namely, the fraction of times the classifier predicts a label with 
the wrong sign. 

To study the privacy-accuracy tradeoff, we compare objective perturbation with the sensitivity 
method for logistic regression and Huber SVM. For Huber SVM, we picked the Huber constant 
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Figure 2: Privacy- Accuracy trade-off for the Adult dataset 
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Figure 3: Privacy- Accuracy trade-off for the KDDCup99 dataset 



h = 0.5, a typical value (IChapellel . I2nn7l i^. For each data set we trained classifiers for a few fixed 



values of A and tested the error of these classifiers. For each algorithm we chose the value of A that 
minimizes the error-rate for Cp = We then plotted the error-rate against Cp for the chosen 

value of A. The results are shown in Figures [2] and [3] for both logistic regression and support vector 
machine^. The optimal values of A are shown in Tables [J and [2j For non-private logistic regression 
and SVM, each presented error-rate is an average over 10-fold cross-validation; for the sensitivity 
method as well as objective perturbation, the presented error-rate is an average over 10-fold cross- 
validation and 50 runs of the randomized training procedure. For Adult, the privacy-accuracy 
tradeoff is computed over the entire dataset, which consists of 45,220 examples; for KDDCup99 we 



IChapellj (|2007l ) recommends using h between 0.01 and 0.5; we use h = 0.5 as we found that a higher value 
typically leads to more numerical stability, as well as better performance for both privacy-preserving methods. 
■^For KDDCup99 the error of the non-private algorithms did not increase with decreasing A. 
^The slight kink in the SVM curve on Adult is due to a switch to the second phase of the algorithm. 
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10-3.0 


10-2.5 
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10-1.5 


Logistic 


















Non-Private 


0.1540 


0.1533 


0.1654 


0.1694 


0.1758 


0.1895 


0.2322 


0.2478 


Output 


0.5318 


0.5318 


0.5175 


0.4928 


0.4310 


0.3163 


0.2395 


0.2456 


Objective 


0.8248 


0.8248 


0.8248 


0.2694 


0.2369 


0.2161 


0.2305 


0.2475 


Huber 


















Non-Private 


0.1527 


0.1521 


0.1632 


0.1669 


0.1719 


0.1793 


0.2454 


0.2478 


Output 


0.5318 


0.5318 


0.5211 


0.5011 


0.4464 


0.3352 


0.2376 


0.2476 


Objective 


0.2585 


0.2585 


0.2585 


0.2582 


0.2559 


0.2046 


0.2319 


0.2478 



Table 1: Error for different regularization parameters on Adult for €p = 0.1. The best error per 
algorithm is in bold. 
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Non-Private 


0.0016 


0.0016 
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0.0038 


0.0037 


0.0037 


0.0325 


0.0594 


Output 


0.5245 


0.5245 


0.5093 


0.3518 


0.1114 


0.0359 


0.0304 


0.0678 


Objective 


0.2084 


0.2084 


0.2084 


0.0196 


0.0118 


0.0113 


0.0285 


0.0591 


Huber 


















Non-Private 


0.0013 


0.0013 


0.0013 


0.0029 


0.0051 


0.0056 


0.0061 


0.0163 


Output 


0.5245 


0.5245 


0.5229 


0.4611 


0.3353 


0.0590 


0.0092 


0.0179 


Objective 


0.0191 


0.0191 


0.0191 


0.1827 


0.0123 


0.0066 


0.0064 


0.0157 



Table 2: Error for different regularization parameters on KDDCup99 for Cp = 0.1. The best error per 
algorithm is in bold. 

use a randomly chosen subset of 70, 000 examples. 

For the Adult dataset, the constant classifier that classifies all examples to be negative acheives 
a classification error of about 0.25. The sensitivity method thus does slightly better than this con- 
stant classifier for most values of for both logistic regression and support vector machines. Objec- 
tive perturbation outperforms sensitivity, and objective perturbation for support vector machines 
achieves lower classification error than objective perturbation for logistic regression. Non-private 
logistic regression and support vector machines both have classification error about 0.15. 

For the KDDCup99 dataset, the constant classifier that classifies all examples as negative, has 
error 0.19. Again, objective perturbation outperforms sensitivity for both logistic regression and 
support vector machines; however, for SVM and high values of (low privacy), the sensitivity 
method performs almost as well as objective perturbation. In the low privacy regime, logistic 
regression under objective perturbation is better than support vector machines. In contrast, in 
the high privacy regime (low Cp), support vector machines with objective perturbation outperform 
logistic regression. For this dataset, non-private logistic regression and support vector machines 
both have a classification error of about 0.001. 

For SVMs on both Adult and KDDCup99, for large €p (0.25 onwards), the error of either of the 
private methods can increase slightly with increasing ep. This seems counterintuitive, but appears 
to be due the imbalance in fraction of the two labels. As the labels are imbalanced, the optimal 
classifier is trained to perform better on the negative labels than the positives. As €p increases, for 
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a fixed training data size, so does the perturbation from tlie optimal classifier, induced by either of 
the private methods. Thus, as the perturbation increases, the number of false positives increases, 
whereas the number of false negatives decreases (as we verified by measuring the average false 
positive and false negative rates of the private classifiers). Therefore, the total error may increase 
slightly with decreasing privacy. 



7.3 Accuracy vs. Training Data Size Tradeoffs 

Next we examine how classification accuracy varies as we increase the size of the training set. We 
measure classification accuracy as the accuracy of the classifier produced by the tuning procedure 
in Section [6l As the Adult dataset is not sufficiently large to allow us to do privacy-preserving 
tuning, for these experiments, we restrict our attention to the KDDCup99 dataset. 

Figures |4] and [5] present the learning curves for objective perturbation, non-private ERM and 
the sensitivity method for logistic loss and Huber loss, respectively. Experiments are shown for 
ep = 0.01 and Cp = 0.05 for both loss functions. The training sets (for each of 5 values of A) are 
chosen to be of size n = 60, 000 to n = 120, 000, and the validation and test sets each are of size 
25, 000. Each presented value is an average over 5 random permutations of the data, and 50 runs 
of the randomized classification procedure. For objective perturbation we performed experiment in 
the regime when > 0, so A = in Algorithm [21^1 

For non-private ERM, we present result for training sets from n = 300, 000 to n = 600, 000. 
The non-private algorithms are tuned by comparing 5 values of A on the same training set, and 
the test set is of size 25, 000. Each reported value is an average over 5 random permutations of the 
data. 

We see from the figures that for non-private logistic regression and support vector machines, 
the error remains constant with increasing data size. For the private methods, the error usually 
decreases as the data size increases. In all cases, objective perturbation outperforms the sensitivity 
method, and support vector machines generally outperform logistic regression. 
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Figure 4: Learning curves for logistic regression on the KDDCup99 dataset 



*This was chosen for a fair comparison with non-private as well as the output perturbation method, both of which 
had access to only 5 values of A. 
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Figure 5: Learning curves for SVM on the KDDCup99 dataset 



8 Discussions and Conclusions 



In this paper we study the problem of learning classifiers with regularized empirical risk mini- 
m ization in a p r ivacy-p reserving manner. We consider privacy in the ep-differential privacy model 
of iDwork et al.l (j2006bl ) and provide two algorithms fo r privac y-preserving ERM. The first one is 
based on the sensitivity method due to iDwork et al.l (|2006bl ) . in which the output of the non- 
private algorithm is perturbed by adding noise. We introduce a second algorithm based on the 
new paradigm of objective perturbation. We provide bounds on the sample requirement of these 
algorithms for achieving generalization error eg. We show how to apply these algorithms with ker- 
nels, and finally, we provide experiments with both algorithms on two real datasets. Our work is, 
to our knowledge, the first to propose computationally efficient classification algorithms satisfying 
differential privacy, together with validation on standard data sets. 

In general, for classification, the error rate increases as the privacy requirements are made more 
stringent. Our generalization guarantees formalize this "price of privacy." Our experiments, as well 
as theoretical results, indicate that objective perturbation usually outperforms the sensitivity meth- 
ods at managing the tradeoff between privacy and learning performance. Both algorithms perform 
better with more training data, and when abundant training data is available, the performance of 
both algorithms can be close to non-private classification. 

The conditions on the loss function and regularizer required by output perturbation and ob- 
jective perturbation are somewhat different. As Theorem [1] shows, output perturbation requires 
strong convexity in the regularizer and convexity as well as a bounded derivative condition in the 
loss function. The last condition can be replaced by a Lipschitz condition instead. However, the 
other two conditions appear to be required, unless we impose some further restrictions on the loss 
and regularizer. Objective perturbation on the other hand, requires strong convexity of the regular- 
izer, convexity, differentiability, and bounded double derivatives in the loss function. Sometimes, it 
is possible to construct a differentiable approximation to the loss function, even if the loss function 
is not itself differentiable, as shown in Section 13.4.21 

Our experimental as well as theoretical results indicate that in general, objective perturbation 
provides more accurate solutions than output perturbation. Thus, if the loss function satisfies the 
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conditions of Theorem [2l we recommend using objective pertm'bation. In some situations, such 
as for SVMs, it is possible that objective perturbation does not directly apply, but applies to an 
approximation of the target loss function. In our experiments, the loss of statistical efficiency due 
to such approximation has been small compared to the loss of efficiency due to privacy, and we 
suspect that this is the case for many practical situations as well. 

Finally, our work does not address the question of finding private solutions to regularized ERM 
when the regularizer is not strongly convex. For example, neither the output perturbation, nor the 
objective perturbation method work for Li-regularized ERM. However, in Li-regularized ERM, 
one can find a dataset in which a change in one training point can significantly change the solution. 
As a result, it is possible that such problems are inherently difficult to solve privately. 

An open question in this work is to extend objective perturbation methods to more general 
convex optimization problems. Currently, the objective perturbation method applies to strongly 
convex regularization functions and differentiable losses. Convex optimization problems appear in 
many contexts within and without machine learning: density estimation, resource allocation for 
communication systems and networking, social welfare optimization in economics, and elsewhere. 
In some cases these algorithms will also operate on sensitive or private data. Extending the ideas 
and analysis here to those settings would provide a rigorous foundation for privacy analysis. 

A second open question is to find a better solution for privacy-preserving classification with 
kernels. Our current met hod is based on a reduction to the linear case, using the algorithm of 
Rahimi and RechtJ (j2nn8bl ): however, this method can be statistically inefficient, and require a lot 



of training data, particularly when cou pled with our privacy mechanism. The reason is that the 
algor ithm of lRahimi and RechtJ (jioOsS) requires the dimension D of the projected space to be very 
high for good performance. However, most differentially-private algorithms perform worse as the 
dimensionality of the data grows. Is there a better linearization method, which is possibly data- 
dependent, that will provide a more statistically efficient solution to privacy-preserving learning 
with kernels? 

A final question is to provide better upper and lower bounds on the sample requirement of 
privacy-preserving linear classification. The main open question here is to provide a computationally 
efficient algorithm for linear classification which has better statistical efficiency. 

Privacy-preserving machine learning is the endeavor of designing private analogues of widely 
used machine learning algorithms. We believe the present study is a starting point for further 
study o f the different i al priv acy model in this relatively new subfield of machine learning. The 
work of Dwork et al. set up a framework for assessing the privacy risks associated with 

publishing the results of data analyses. Demanding high privacy requires sacrificing utility, which in 
the context of classification and prediction is excess loss or regret. In this paper we demonstrate the 
privacy-utility tradeoff for ERM, which is but one corner of the machine learning world. Applying 
these privacy concepts to other machine learning problems will lead to new and interesting tradeoffs 
and towards a set of tools for practical privacy-preserving learning and inference. We hope that 
our work provides a benchmark of the current price of privacy, and inspires improvements in future 
work. 
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